ISSS608-VAA
  • Hands-on Exercise
    • Hands-on Exercise 1 - DataVis with ggplot2
    • Hands-on Exercise 2 - Beyond ggplot2 Fundamentals
    • Hands-on Exercise 3 - Interactive Data Visualisation
    • Hands-on Exercise 4 - Visual Statistical Analysis
    • Hands-on Exercise 4 - Visualising Uncertainty
    • Hands-on Exercise 4 - Building Funnel Plot with R
    • Hands-on Exercise 5 - Building Ternary Plot with R
    • Hands-on Exercise 5 - Visualising Correlation Matrices with R
    • Hands-on Exercise 5 -Building Heatmap for Visualising and Analysing Multivariate Data
    • Hands-on Exercise 5 - Building Parallel Coordinates Plot with R
  • In-class Exercise
    • In-class Exercise 1
    • In-class Exercise 2
    • In-class Exercise 3
    • In-class Exercise 4
    • In-class Exercise 5
    • In-class Exercise 6
    • In-class Exercise 7
    • In-class Exercise 8
    • In-class Exercise 9
    • In-class Exercise 10
  • Take-home Exercise
    • Take-home Exercise 1
    • Take-home Exercise 2
  • Home
  • About

On this page

  • 1. Overview
  • 2. Data
  • 3. Get Started
    • 3.1 Import Libraries
    • 3.2 Load Data
    • 3.3 Data Wrangling
      • 3.3.1 Data filtering
      • 3.3.2 Formatting renaming leases
      • 3.3.3 Derive resale unit price Per Square Meter (PSM)
      • 3.3.4 Preview of Data
  • 4. Factors impact on the trends of resale unit price
    • 4.1 Timing (e.g., Month)
    • 4.2 Location (e.g., Town)
    • 4.3 Model of Flat (e.g., flat model)
    • 4.4 Age of the flat (e.g., Remaining lease years)
    • 4.5 Sizing of the flat (e.g., flat area)
    • 4.6 View sight of the flat (e.g., storey range of the flat)
  • 5. Oneway ANOVA Test
    • 5.1 One-way ANOVA test on resale unit price by flat type
    • 5.2 One-way ANOVA test on resale unit price by flat model
    • 5.3 Correlation Matrix on flat area, remaining leases, resale price, floor range and unit price (psm)
  • 6. Model Diagnostic
  • 7. Summary
  • 8. Reflaction

Exercise 3: Uncover patterns of resale price of HDB in SG

Author

Hou Tao

Published

February 5, 2023

Modified

February 14, 2023

1. Overview

In this take-home exercise 3, we’re going to use visual analytics techniques from R to discover the patterns of resale prices of public housing property within year 2022.

2. Data

Data set Resale flat prices based on registration date from Jan-2017 onwards form from Department of Statistics, Singapore to be used in this exercise.

3. Get Started

3.1 Import Libraries

pacman::p_load(performance, parameters, ggdist, gganimate, ungeviz, plotly, tidyverse)

3.2 Load Data

df <- read_csv("data/resale-flat-prices-2017-onwards.csv", show_col_types = FALSE)

3.3 Data Wrangling

3.3.1 Data filtering

This analysis is going to focus on the flat type from 3 ROOMs to 5 ROOMs within year 2022.

mutated_df <- df %>%
  filter(flat_type == "3 ROOM" | 
         flat_type == "4 ROOM" | 
         flat_type == "5 ROOM") %>%
  mutate(year = as.character(substring(month, 0, 4))) %>%
  mutate(month = as.character(substring(month, 6, 7))) %>%
  mutate(floor_range = as.numeric(substring(storey_range, 6, 8))) %>%
  filter(year == 2022)

3.3.2 Formatting renaming leases

The remaining leases is in string format, data conversion from string to numeric format for better data analytics.

format_remaining_lease <- function(t) {
  t <- str_replace_all(t, "years", ".")
  t <- str_replace_all(t, "year", ".")
  t <- str_replace_all(t, "months", "")
  t <- str_replace_all(t, "month", "")
  t <- str_replace_all(t, " ", "")
  return(round(as.numeric(t), digits = 1))
}
mutated_df$remaining_lease <- 
  format_remaining_lease(mutated_df$remaining_lease)

3.3.3 Derive resale unit price Per Square Meter (PSM)

In this study, Resale unit price per square meter (PSM) is going to be applied for subsequent resale price pattern analysis, the PSM is computed based on the floor area and the unit resale price.

mutated_df$psm <- round(mutated_df$resale_price / mutated_df$floor_area_sqm, 
                        digits = 2)

ds <- mutated_df

3.3.4 Preview of Data

Data previous for first three row from the data.

head(ds, n = 3)
# A tibble: 3 × 14
  month town       flat_…¹ block stree…² store…³ floor…⁴ flat_…⁵ lease…⁶ remai…⁷
  <chr> <chr>      <chr>   <chr> <chr>   <chr>     <dbl> <chr>     <dbl>   <dbl>
1 01    ANG MO KIO 3 ROOM  320   ANG MO… 07 TO …      73 New Ge…    1977      54
2 01    ANG MO KIO 3 ROOM  225   ANG MO… 07 TO …      67 New Ge…    1978      55
3 01    ANG MO KIO 3 ROOM  331   ANG MO… 07 TO …      68 New Ge…    1981      58
# … with 4 more variables: resale_price <dbl>, year <chr>, floor_range <dbl>,
#   psm <dbl>, and abbreviated variable names ¹​flat_type, ²​street_name,
#   ³​storey_range, ⁴​floor_area_sqm, ⁵​flat_model, ⁶​lease_commence_date,
#   ⁷​remaining_lease

4. Factors impact on the trends of resale unit price

4.1 Timing (e.g., Month)

First of all, we first to analyzing the trends of the resale price changes for every month, it is shown that monthly resale price is moving upwards trends.

ds42 <- ds %>%
  group_by(month) %>%
  summarise(n = n(), 
            average = mean(psm), 
            sd = sd(psm), 
            .groups = "drop") %>%
  mutate(se = sd / sqrt(n - 1)) %>%
  mutate(min_3std = round(average - se * 3, digits = 3)) %>%
  mutate(max_3std = round(average + se * 3, digits = 3))
  
hds42 <- highlight_key(ds42)

p2 <- ggplot(data = hds42, 
                aes(x = month, 
                    y = average, 
                    text=c(paste0("min: ", 
                                  min_3std , 
                                  "\nmax: ", 
                                  max_3std)))) +
  
  geom_errorbar(aes(x = month,  
                    ymin = min_3std,  
                    ymax = max_3std), 
                width=0.2, 
                colour="skyblue", 
                alpha=0.9) +
  geom_point(aes(x = month, y = average), 
             stat="identity", 
             color="skyblue", 
             size = 1, 
             alpha=1) +
  
  ggtitle("Trends of Resale Unit Price per Month") +
  
  labs(x = "Month", 
       y = "Price per Square Meter (SGD)") +
  
  theme_bw()+
  theme(axis.text.x = element_text(angle = 80, 
                                   vjust = 0.5, 
                                   hjust=1)) 



p2 <- ggplotly(p2, width = 620, height = 400)

p2 <- highlight(ggplotly(p2), off="plotly_doubleclick")

p2

4.2 Location (e.g., Town)

Upon understand the trends of resale unit price goes high, we now try to see the price of each town, it is shown that the central area has high resale unit price followed by Queenstown. it turns out that the location of the town has significant impact on resale unit price in Singapore.

p1 <- ds %>%
  mutate(town = fct_reorder(town, desc(psm), 
                            .fun = "mean")) %>%
  ggplot(aes(x = town, y = psm)) +
  geom_violin(fill = "skyblue", linewidth = 0.1, 
              draw_quantiles = c(0.25, 0.5, 0.75)) + 
  stat_summary(geom = "point", 
               fun = "mean", 
               colour = "navy", 
               size = 0.5) + 
  xlab("Town Area") +
  ylab("Price per Square Meter (SGD)") +
  ggtitle("Trends of Resale Unit Price per Town")+
  theme_bw() + 
  theme(legend.position = "top", 
        legend.box.just = "top",
        legend.box = "horizontal",
        legend.direction = "horizontal",
        axis.text.x = element_text(angle = 55, 
                                   hjust=0.2))
  
p1 <- ggplotly(p1, width = 780, height = 600)


p1

4.3 Model of Flat (e.g., flat model)

The chart below shows that both float models are also matters.

It is shown that Type S1, Type S2, DBSS, Premium Apartment, and Preimum Apartment Loft and Terrace are having dominant advantages over flat models such as Adjoined flat, Model A2, Model A-Maisonette.

It also reveal that the location still matters in resale price, for example, Bukit Merah are more expensive than Queenstown on the same flat type “Improved” even the storey range of Queenstown is higher than Bukit Merah.

ds43 <- ds %>% 
    group_by(storey_range, flat_model, town, remaining_lease) %>%
    summarise(n = n(), 
              average = mean(psm), 
              sd = sd(psm), 
              .groups = "drop")

p3 <- ggplot(data = ds43, aes(x = flat_model, 
                            y = average, 
                            color = town,
                            shape = storey_range)) +
  geom_point(size = 0.7) +
  scale_color_discrete(name = "Town") +
  scale_shape_discrete(name = "Storey Range") +
  ggtitle("Resale Unit Price vs Flat Model vs Storey Range vs Town") +
  xlab("Flat Model") +
  ylab("Price per Square Meter (SGD)")+
  theme_bw()+
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 35, 
                                   hjust=0.2)) 

p3 <- ggplotly(p3, width = 780, height = 480) %>%
  layout(legend = list(orientation = "h"))

p3

4.4 Age of the flat (e.g., Remaining lease years)

The chart below shows that new flat are dominant by Model A, Preimum Apartment, DBSS, those flat are easier to have a better resale unit price than Standard, Improved and New Genernation flat model in Singapore.

ds43 <- ds %>% 
    group_by(storey_range, flat_model, town, remaining_lease) %>%
    summarise(n = n(), 
              average = mean(psm), 
              sd = sd(psm), 
              .groups = "drop")

p3 <- ggplot(data = ds43, aes(x = remaining_lease, 
                            y = average, 
                            color = flat_model,
                            shape = storey_range)) +
  geom_point(size = 0.7) +
  scale_color_discrete(name = "Town") +
  scale_shape_discrete(name = "Storey Range") +
  ggtitle("Resale Unit Price vs Remaning Lease Years vs Town") +
  xlab("Remaning Lease Years") +
  ylab("Price per Square Meter (SGD)")+
  theme_bw()+
  theme(axis.text.x = element_text(vjust = -1),
        legend.position = "none")

p3 <- ggplotly(p3, width = 780, height = 480) %>%
  layout(legend = list(orientation = "h"))

p3

4.5 Sizing of the flat (e.g., flat area)

The chart below shows that resale price is not directly has linearship relationship versus floor area, it is still dominant by the location, storey range by remaining years.

ds44 <- ds %>% 
    group_by(floor_area_sqm, town, storey_range) %>%
    summarise(n = n(), 
              resale_price = mean(resale_price), 
              sd = sd(resale_price), 
              .groups = "drop")

p4 <- ggplot(data = ds44, aes(x = floor_area_sqm, 
                            y = resale_price, 
                            color = town,
                            shape = storey_range)) +
  geom_point(size = 0.7) +
  scale_color_discrete(name = "Flat Model") +
  ggtitle("Trends of Resale price vs Floor Area vs Storey Range vs Town") +
  xlab("Floor Area (PSM)") +
  ylab("Resale Price (SGD)")+
  theme_bw()+
  theme(axis.text.x = element_text(vjust = -1),
        legend.position = "none") 

p4 <- ggplotly(p4, width = 780, height = 480) %>%
  layout(legend = list(orientation = "h"))

p4

4.6 View sight of the flat (e.g., storey range of the flat)

The chart below reveal that the price has positive relationship versus the storey range of flat, higher storey always having higher price in singapore.

p4 <- ds %>%
  
  mutate(storey_range = 
           fct_reorder(storey_range, desc(psm), 
                            .fun = "mean")) %>%
  
  ggplot(aes(x = storey_range, y = psm)) +
  
  stat_dist_gradientinterval(fill_type = "segments",
                             show.legend = TRUE,
                             fill = "skyblue")+
  
  geom_hpline(data = sampler(10, group = flat_model), 
              color = "#D55E00", size = 0.3) +
  
  theme_bw()+
  
  theme(axis.text.x = element_text(angle = 80, 
                                   vjust = 0.3, 
                                   hjust=0.1), 
        legend.position = "right") +
  
  transition_states(.draw, 1, 3) +
  
  labs(title = "Resale unit price per Storey Range", 
       ylab = "Price per Square Meter (SGD)",
       xlab = "Storey Range")

p4

5. Oneway ANOVA Test

5.1 One-way ANOVA test on resale unit price by flat type

One-Way ANOVA test on the resale unit price per flat type is able help to understand the signficance of resale unit price vs flat type, it turn out the flat type does slightly difference among each flat type.

ggstatsplot::ggbetweenstats(
  data = ds,
  x = flat_type, 
  y = psm,
  type = "np",
  mean.ci = TRUE, 
  pairwise.comparisons = TRUE, 
  pairwise.display = "s",
  p.adjust.method = "fdr",
  messages = TRUE,
  xlab = "Flat Type",
  ylab = "Price per Square Meter (SGD)",
  title = "One-way ANOVA test on resale unit price by flat type"
)

5.2 One-way ANOVA test on resale unit price by flat model

To further understand the resale unit price on the flat itself, one-way ANOVA test is conducted on the flat model. it is tested and proved that the flat mode will having impact on the price.

ggstatsplot::ggbetweenstats(
  data = ds,
  x = flat_model, 
  y = psm,
  type = "np",
  mean.ci = FALSE, 
  pairwise.comparisons = FALSE, 
  pairwise.display = "s",
  p.adjust.method = "fdr",
  messages = FALSE,
  xlab = "Storey Range",
  ylab = "Price per Square Meter (SGD)",
  title = "One-way ANOVA test on resale unit price by storey range"
)+
  theme_bw()+
  theme(axis.text.x = 
          element_text(
            angle = 80, 
            vjust = 0.3, 
            hjust=0.1))

5.3 Correlation Matrix on flat area, remaining leases, resale price, floor range and unit price (psm)

We also try to analysis the correlations among flat area, remaining leases, resale price, floor range and unit price (psm), chat shown that the PSM having positive relationship with floor range and more ramining years of flat will have better resale price.


ggstatsplot::grouped_ggcorrmat(
  data = ds,
  cor.vars = c(6, 9, 10, 12, 13),
  grouping.var = flat_type,
  type = "np",
  p.adjust.methods = "holm",
  plotgrid.args = list(ncol = 3),
  ggcorrplot.args = list(outline.color = "black", 
                         hc.order = TRUE,
                         tl.cex = 10),
  annotation.args = list(
    tag_levels = "a",
    title = "Correlogram for Resale Price",
    subtitle = "The measures are: flat area, remaining leases, resale price, floor range and unit price (psm)"
  )
)

6. Model Diagnostic

upon all factors analyized using EDA approach, we could also model the resale price factors using models in R.

Resale price of flat can be modelled based on all factors - town area, flat type, storey range, floor area, flat model, remaining lease and unit price (psm). then, output the model for analysis.

model <- lm(data = ds, 
              resale_price ~ town + flat_type + storey_range + 
              floor_area_sqm + flat_model + remaining_lease + psm)

model

Call:
lm(formula = resale_price ~ town + flat_type + storey_range + 
    floor_area_sqm + flat_model + remaining_lease + psm, data = ds)

Coefficients:
                     (Intercept)                         townBEDOK  
                      -4.719e+05                        -4.529e+03  
                      townBISHAN                   townBUKIT BATOK  
                       1.108e+04                        -2.233e+02  
                 townBUKIT MERAH                 townBUKIT PANJANG  
                      -4.589e+03                        -5.223e+03  
                 townBUKIT TIMAH                  townCENTRAL AREA  
                       5.401e+03                        -3.299e+04  
               townCHOA CHU KANG                      townCLEMENTI  
                      -1.094e+04                        -8.843e+03  
                     townGEYLANG                       townHOUGANG  
                      -1.588e+03                        -5.733e+03  
                 townJURONG EAST                   townJURONG WEST  
                      -5.762e+03                        -8.958e+03  
             townKALLANG/WHAMPOA                 townMARINE PARADE  
                      -7.554e+03                        -2.905e+03  
                   townPASIR RIS                       townPUNGGOL  
                      -1.592e+04                        -2.167e+03  
                  townQUEENSTOWN                     townSEMBAWANG  
                      -1.353e+04                        -5.486e+03  
                    townSENGKANG                     townSERANGOON  
                      -4.956e+03                        -5.655e+03  
                    townTAMPINES                     townTOA PAYOH  
                      -7.329e+03                         3.431e+03  
                   townWOODLANDS                        townYISHUN  
                      -1.033e+04                        -3.526e+03  
                 flat_type4 ROOM                   flat_type5 ROOM  
                       1.640e+04                         2.064e+04  
            storey_range04 TO 06              storey_range07 TO 09  
                       4.213e+02                         4.351e+02  
            storey_range10 TO 12              storey_range13 TO 15  
                      -2.897e+02                         1.091e+03  
            storey_range16 TO 18              storey_range19 TO 21  
                       2.873e+03                         3.270e+03  
            storey_range22 TO 24              storey_range25 TO 27  
                       4.385e+03                         1.791e+03  
            storey_range28 TO 30              storey_range31 TO 33  
                       2.315e+03                        -1.211e+04  
            storey_range34 TO 36              storey_range37 TO 39  
                      -9.265e+03                        -1.910e+04  
            storey_range40 TO 42              storey_range43 TO 45  
                       8.991e+03                        -4.011e+04  
            storey_range46 TO 48              storey_range49 TO 51  
                       3.583e+04                        -1.300e+03  
                  floor_area_sqm           flat_modelAdjoined flat  
                       5.263e+03                        -2.523e+04  
                  flat_modelDBSS                flat_modelImproved  
                      -1.141e+04                        -1.627e+04  
   flat_modelImproved-Maisonette                 flat_modelModel A  
                      -7.035e+03                        -2.189e+04  
    flat_modelModel A-Maisonette                flat_modelModel A2  
                      -1.621e+00                        -1.591e+04  
        flat_modelNew Generation       flat_modelPremium Apartment  
                      -1.178e+04                        -1.988e+04  
flat_modelPremium Apartment Loft              flat_modelSimplified  
                       4.089e+04                        -2.085e+04  
              flat_modelStandard                 flat_modelTerrace  
                      -1.105e+04                        -8.207e+02  
               flat_modelType S1                 flat_modelType S2  
                       1.940e+04                         8.417e+04  
                 remaining_lease                               psm  
                      -1.693e+02                         9.340e+01  

Check correlation of model attributes based on the model constructed.

check_collinearity(model)
# Check for Multicollinearity

Low Correlation

 Term  VIF   VIF 95% CI Increased SE Tolerance Tolerance 95% CI
  psm 5.89 [5.76, 6.03]         2.43      0.17     [0.17, 0.17]

Moderate Correlation

           Term   VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
 floor_area_sqm 17.85 [17.42, 18.29]         4.23      0.06     [0.05, 0.06]
     flat_model 32.99 [32.18, 33.81]         5.74      0.03     [0.03, 0.03]

High Correlation

            Term   VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
    storey_range  2.50   [2.45, 2.55]         1.58      0.40     [0.39, 0.41]
 remaining_lease  5.77   [5.64, 5.90]         2.40      0.17     [0.17, 0.18]
            town 19.76 [19.29, 20.25]         4.45      0.05     [0.05, 0.05]
       flat_type 36.24 [35.36, 37.15]         6.02      0.03     [0.03, 0.03]

Perform model diagnostic through complete check for models

check_model(model)

6.2 checking for multicolinearity for refined model

Revise the model to remove high correlated variables and based on the correlation analysis and perform model diagnostic through complete check for models.

rev_model <- lm(data = ds, resale_price ~ floor_area_sqm + flat_model + psm + remaining_lease)

check_model(rev_model)

6.3 Visualising Regression Parameters

We can visualize the parameters for the regression model below, the model also reveal that Preminum Apartment and Type S1 are having positive impact on the resale price.

plot(parameters(rev_model))

7. Summary

The study of Singapore public housing pricing are having many different factors, such as location, age of flat, flat model, size of flat from overall perspective.

People always got preferences, for example, if a person prefer to living at Ang Mo Kio, or Bishan, then, flat model, age of the flat, and location of the flat within the area, storey of the flat are become next few factors to be concerned when a people plan to buy a flat.

In summary, this exercise with EDA provides better insight for people to understand how to move forward to choose their dream home at reasonable cost in Singapore

8. Reflaction

This exercise shows us that there are two approaches to allow us to analyze the factors with regard to the information of interest. we can have different options based on our target audience and use cases.

  • To public users with less statistical knowledge audience, EDA approach can be adopted to provide a better view and insight into the data through visualization.

  • To develop a statistical model for prediction and forecast, a statistical model approach can be adapted to have various statistical analysis visualization for internal communication and business model development.